Hyperparameter Tuning with Optuna

With great models, comes the great problem of optimizing hyperparameters [Tha20]. Once a good search algorithm is established for hyperparameter optimization, the task becomes an engineering problem 1. Hence, we will explore an open-source library that offers a framework for solving this task.

../_images/optuna.png

Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API. Thanks to our define-by-run API, the code written with Optuna enjoys high modularity, and the user of Optuna can dynamically construct the search spaces for the hyperparameters.

Basics with scikit-learn

Optuna is a black-box optimizer, which means it only needs an objective function, which is any function that returns a numerical value, to evaluate the performance of the its parameters, and decide where to sample in upcoming trials. An optimization problem is framed in the Optuna API using two basic concepts: study and trial.

A study is conceptually an optimization based on an objective function, while a trial is a single execution of an objective function. The combination of hyperparameters for each trial is sampled according to some sampling algorithm defined by the study.

In the following code example, the search space is constructed within imperative Python code, e.g. inside conditionals or loops. On the other hand, recall that for GridSearchCV and RandomSearchCV in scikit-learn, we had to define the entire search space before running the search algorithm.

!pip install optuna
import optuna
import pandas as pd
from sklearn import ensemble, svm
from sklearn import datasets
from sklearn import model_selection
from functools import partial
import joblib


# [1] Define an objective function to be maximized.
def objective(trial, X, y):
    
    # [2] Suggest values for the hyperparameters using trial object.
    clf_name = trial.suggest_categorical('classifier', ['SVC', 'RandomForest'])
    if clf_name == 'SVC':
        svc_c = trial.suggest_loguniform('svc_c', 1e-10, 1e10)
        clf = svm.SVC(C=svc_c, gamma='auto')
    else:
        rf_max_depth = int(trial.suggest_loguniform('rf_max_depth', 2, 32))
        clf = ensemble.RandomForestClassifier(max_depth=rf_max_depth, n_estimators=10)

    score = model_selection.cross_val_score(clf, X, y, n_jobs=-1, cv=5)
    return score.mean()

# [3] Create a study object and optimize the objective function.
X, y = datasets.load_breast_cancer(return_X_y=True)
study = optuna.create_study(direction="maximize")
study.optimize(partial(objective, X=X, y=y), n_trials=5)
Requirement already satisfied: optuna in /usr/local/lib/python3.7/dist-packages (2.9.1)
Requirement already satisfied: cliff in /usr/local/lib/python3.7/dist-packages (from optuna) (3.9.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from optuna) (1.19.5)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from optuna) (4.62.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (21.0)
Requirement already satisfied: scipy!=1.4.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.1)
Requirement already satisfied: colorlog in /usr/local/lib/python3.7/dist-packages (from optuna) (6.4.1)
Requirement already satisfied: alembic in /usr/local/lib/python3.7/dist-packages (from optuna) (1.7.3)
Requirement already satisfied: sqlalchemy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.23)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from optuna) (3.13)
Requirement already satisfied: cmaes>=0.8.2 in /usr/local/lib/python3.7/dist-packages (from optuna) (0.8.2)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->optuna) (2.4.7)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (1.1.1)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (4.8.1)
Requirement already satisfied: Mako in /usr/local/lib/python3.7/dist-packages (from alembic->optuna) (1.1.5)
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.7/dist-packages (from alembic->optuna) (5.2.2)
Requirement already satisfied: autopage>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (0.4.0)
Requirement already satisfied: cmd2>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (2.2.0)
Requirement already satisfied: pbr!=2.1.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (5.6.0)
Requirement already satisfied: stevedore>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (3.4.0)
Requirement already satisfied: PrettyTable>=0.7.2 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (2.2.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (3.7.4.3)
Requirement already satisfied: pyperclip>=1.6 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (1.8.2)
Requirement already satisfied: colorama>=0.3.7 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (0.4.4)
Requirement already satisfied: wcwidth>=0.1.7 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (0.2.5)
Requirement already satisfied: attrs>=16.3.0 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (21.2.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->sqlalchemy>=1.1.0->optuna) (3.5.0)
Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.7/dist-packages (from Mako->alembic->optuna) (2.0.1)
[I 2021-09-23 17:54:39,422] A new study created in memory with name: no-name-e4f2c3f5-5d18-48a2-9e80-803c7890c30c
[I 2021-09-23 17:54:40,580] Trial 0 finished with value: 0.9525229001707809 and parameters: {'classifier': 'RandomForest', 'rf_max_depth': 3.0117830841670483}. Best is trial 0 with value: 0.9525229001707809.
[I 2021-09-23 17:54:40,684] Trial 1 finished with value: 0.6274181027790716 and parameters: {'classifier': 'SVC', 'svc_c': 15275857.681467317}. Best is trial 0 with value: 0.9525229001707809.
[I 2021-09-23 17:54:40,783] Trial 2 finished with value: 0.6274181027790716 and parameters: {'classifier': 'SVC', 'svc_c': 297357093.2564345}. Best is trial 0 with value: 0.9525229001707809.
[I 2021-09-23 17:54:40,884] Trial 3 finished with value: 0.6274181027790716 and parameters: {'classifier': 'SVC', 'svc_c': 241127.03816762505}. Best is trial 0 with value: 0.9525229001707809.
[I 2021-09-23 17:54:41,005] Trial 4 finished with value: 0.9543238627542306 and parameters: {'classifier': 'RandomForest', 'rf_max_depth': 6.143111746204174}. Best is trial 4 with value: 0.9543238627542306.

The study object saves the result of evaluating the objective each trial — which is essentially some choice of hyperparameters to evaluate. In the above study, the problem of model selection is framed as a hyperparameter optimization problem. Here we choose between an SVM-based algorithm or Random Forest.

study.trials_dataframe().head()
number value datetime_start datetime_complete duration params_classifier params_rf_max_depth params_svc_c state
0 0 0.952523 2021-09-23 17:54:39.425578 2021-09-23 17:54:40.579898 0 days 00:00:01.154320 RandomForest 3.011783 NaN COMPLETE
1 1 0.627418 2021-09-23 17:54:40.582807 2021-09-23 17:54:40.684042 0 days 00:00:00.101235 SVC NaN 1.527586e+07 COMPLETE
2 2 0.627418 2021-09-23 17:54:40.685710 2021-09-23 17:54:40.782860 0 days 00:00:00.097150 SVC NaN 2.973571e+08 COMPLETE
3 3 0.627418 2021-09-23 17:54:40.784771 2021-09-23 17:54:40.884335 0 days 00:00:00.099564 SVC NaN 2.411270e+05 COMPLETE
4 4 0.954324 2021-09-23 17:54:40.886240 2021-09-23 17:54:41.004901 0 days 00:00:00.118661 RandomForest 6.143112 NaN COMPLETE

Fine tuning Random Forest

Here we focus on tuning a single Random Forest model. Then, plot the accuracy for each pair of hyperparameters.

def objective(trial):
    
    max_depth = trial.suggest_int('max_depth', 2, 128, log=True)    
    max_features = trial.suggest_float('max_features', 0.1, 1.0)    
    n_estimators = trial.suggest_int('n_estimators', 100, 800)
    
    clf = ensemble.RandomForestClassifier(
        max_depth=max_depth,
        n_estimators=n_estimators,
        max_features=max_features,
        random_state=42)   
    
    score = model_selection.cross_val_score(clf, X, y, n_jobs=-1, cv=5)
    return score.mean()


study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10)
[I 2021-09-23 17:54:41,088] A new study created in memory with name: no-name-1118f726-3850-4750-9359-55c0ea45a8f8
[I 2021-09-23 17:54:48,956] Trial 0 finished with value: 0.9578481602235678 and parameters: {'max_depth': 8, 'max_features': 0.13883223749967205, 'n_estimators': 581}. Best is trial 0 with value: 0.9578481602235678.
[I 2021-09-23 17:54:55,549] Trial 1 finished with value: 0.9596491228070174 and parameters: {'max_depth': 119, 'max_features': 0.8704051169259739, 'n_estimators': 188}. Best is trial 1 with value: 0.9596491228070174.
[I 2021-09-23 17:55:13,415] Trial 2 finished with value: 0.9596180717279925 and parameters: {'max_depth': 44, 'max_features': 0.6700603339499863, 'n_estimators': 698}. Best is trial 1 with value: 0.9596491228070174.
[I 2021-09-23 17:55:20,477] Trial 3 finished with value: 0.9613724576929048 and parameters: {'max_depth': 9, 'max_features': 0.228797623879799, 'n_estimators': 717}. Best is trial 3 with value: 0.9613724576929048.
[I 2021-09-23 17:55:29,341] Trial 4 finished with value: 0.9613724576929048 and parameters: {'max_depth': 29, 'max_features': 0.6566210378425625, 'n_estimators': 580}. Best is trial 3 with value: 0.9613724576929048.
[I 2021-09-23 17:55:31,665] Trial 5 finished with value: 0.9578792113025927 and parameters: {'max_depth': 10, 'max_features': 0.9737412231250336, 'n_estimators': 119}. Best is trial 3 with value: 0.9613724576929048.
[I 2021-09-23 17:55:34,767] Trial 6 finished with value: 0.9490607048594939 and parameters: {'max_depth': 2, 'max_features': 0.16966704315981518, 'n_estimators': 371}. Best is trial 3 with value: 0.9613724576929048.
[I 2021-09-23 17:55:40,593] Trial 7 finished with value: 0.956078248719143 and parameters: {'max_depth': 4, 'max_features': 0.16266334215164907, 'n_estimators': 668}. Best is trial 3 with value: 0.9613724576929048.
[I 2021-09-23 17:55:42,917] Trial 8 finished with value: 0.9578326346840553 and parameters: {'max_depth': 57, 'max_features': 0.16917559497644086, 'n_estimators': 243}. Best is trial 3 with value: 0.9613724576929048.
[I 2021-09-23 17:55:48,998] Trial 9 finished with value: 0.95960254618848 and parameters: {'max_depth': 6, 'max_features': 0.2007403313381485, 'n_estimators': 615}. Best is trial 3 with value: 0.9613724576929048.
study.best_params
{'max_depth': 9, 'max_features': 0.228797623879799, 'n_estimators': 717}
study.best_value
0.9613724576929048

Sampling algorithms

import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=1, ncols=3)

def plot_results(study, p1, p2, j, cb):
    study.trials_dataframe().plot(
        kind='scatter', ax=axes[j], x=p1, y=p2,
        c='value', s=60, cmap=plt.get_cmap("jet"), 
        colorbar=cb, label="accuracy", figsize=(16, 4)
    )

plot_results(study, 'params_max_depth',    'params_n_estimators', j=0, cb=False)
plot_results(study, 'params_max_depth',    'params_max_features', j=1, cb=False)
plot_results(study, 'params_n_estimators', 'params_max_features', j=2, cb=True);
../_images/hyperopt_optuna3_16_0.png

Figure. TPE in action. Optuna uses Tree-structured Parzen Estimater (TPE) [BBBK11] as the default sampler which is a form of Bayesian optimization. Observe that the hyperparameter space is searched more efficiently than a random search with the sampler choosing points closer to previous good results. Samplers are specified when creating a study:

study = create_study(direction="maximize", sampler=optuna.samplers.TPESampler())

From the docs:

On each trial, for each parameter, TPE fits one Gaussian Mixture Model (GMM) l(x) to the set of parameter values associated with the best objective values, and another GMM g(x) to the remaining parameter values. It chooses the parameter value x that maximizes the ratio l(x)/g(x).

Thus, TPE samples every hyperparameter independently — no explicit hyperparameter interactions are considered when sampling future trials, although other parameters implicitly affect objective value. Optuna also implements old friends random and grid search in the following samplers:

  • optuna.samplers.GridSampler

  • optuna.samplers.RandomSampler

Results from the paper [ASY+19]:

../_images/fig9-optuna.png
../_images/fig10-optuna.png
../_images/optuna-results.png


TPE+CMA-ES sampling can be implemented as follows:

sampler = optuna.samplers.CmaEsSampler(
    warn_independent_sampling=False,
    independent_sampler=optuna.samplers.TPESampler()
)

This uses the CMA-ES algorithm [Han16] with TPE for searching dynamically constructed hyperparameters (as CMA-ES requires that parameters are specified prior to the optimization).

Visualizations

First define a helper function for displaying plotly plots as HTML.

from IPython.core.display import display, HTML
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
config={'showLink': False, 'displayModeBar': False}
fig_count = 0

# See https://github.com/executablebooks/jupyter-book/issues/93 <!>
# Solves issue of having blank plotly plots in the build. No need to
# save the generated HTML files. Probably embedded into the notebook.
def plot_html(fig):
    global fig_count
    plot(fig, filename=f'optuna-{fig_count}.html', config=config)
    display(HTML(f'optuna-{fig_count}.html'))
    fig_count += 1

Optuna provides visualization functions in the optuna.visualization library 2. The following plot shows the best objective value found as the trials progress. The increasing trend in accuracy indicates that the TPE sampler is working well, i.e. the search algorithm learns from previous trials.

optuna.visualization.plot_optimization_history(study)

The parallel coordinate plot gives us a feel of how the hyperparameters interact. For instance, max_features around 0.5 with n_estimators around 280 and max_depth around 20 generally perform well. This setting includes the best performing hyperparameters. To isolate subsets of lines, use the interactive capabilities of the plot below by dragging on each axis to restrict it. See figure immediately below.

optuna.visualization.plot_parallel_coordinate(study)
../_images/optuna-restrict-rf.png

Fig. 8 Using sliders to restrict values for certain parameters.

Slice plots project the path of the optimizer in the hyperparameter space on each dimension, then shift along the \(y\)-axis according on its objective value. A large spread of dark dots indicate that a large range of values of that hyperparameter is feasible even at later stages. Meanwhile, a small spread means that the sampler focuses on a small part of the search space — in this case, other hyperparameters implicitly improve the objective. For example, the parameter max_features is explored at a wide range even at later trials. Hence, we think of this feature as important. Indeed, the importance plot below supports this.

plot_html(optuna.visualization.plot_slice(study, params=['n_estimators', 'max_depth', 'max_features']))

By default, the hyperparameter importance evaluator in Optuna is optuna.importance.FanovaImportanceEvaluator. This takes as input performance data gathered with different hyperparameter settings of the algorithm, fits a random forest to capture the relationship between hyperparameters and performance, and then applies functional ANOVA to assess how important each of the hyperparameters and each low-order interaction of hyperparameters is to performance [HHLB14]. From the docs:

The performance of fANOVA depends on the prediction performance of the underlying random forest model. In order to obtain high prediction performance, it is necessary to cover a wide range of the hyperparameter search space. It is recommended to use an exploration-oriented sampler such as RandomSampler.

fig = optuna.visualization.plot_param_importances(study)
fig.update_layout(width=600, height=350)
plot_html(fig)

To visualize interactions of any pair of hyperparameters, we use contour plots. The contour plots indicate regions of high and low objective value.

fig = optuna.visualization.plot_contour(study, params=["max_depth", "max_features"])
fig.update_layout(width=550, height=500)
plot_html(fig)

Neural networks

As noted above, we should always perform tuning within a cross-validation framework. However, with neural networks, doing 5-fold CV would require too much compute time — hence, too much resources, e.g. GPU usage. Instead, we perform tuning on a hold-out validation set and hope for the best.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torch.utils.data import Dataset, DataLoader

from sklearn import model_selection
from sklearn.datasets import fetch_openml

from tqdm import tqdm
import optuna
import numpy as np

Define a simple network.

class MLPClassifier(nn.Module):
    """
    Neural network with multiple hidden fully-connected layers with ReLU 
    activation and dropout.
    """
    
    def __init__(self, input_size, num_classes, n_layers, out_features, drop_rate):
        super().__init__()
        layers = []
        in_features = input_size
        for i in range(n_layers):

            m = nn.Linear(in_features, out_features[i])
            nn.init.kaiming_normal_(m.weight)
            nn.init.constant_(m.bias, 0)

            layers.append(m)
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(drop_rate))

            in_features = out_features[i]

        layers.append(nn.Linear(in_features, num_classes))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

We also define a Dataset class for MNIST.

class MNISTDataset(Dataset):
    def __init__(self, features, targets, transform=None):
        self.features = features
        self.targets = targets
        self.transform = transform
        
    def __len__(self):
        return self.features.shape[0]
    
    def __getitem__(self, i):
        X = self.features[i, :]
        y = self.targets[i]
        
        if self.transform is not None:
            X = self.transform(X)
            
        return X, y

Define a trainer for the neural network model. This will handle all loss and metric evaluation, as well as backpropagation.

class Engine:
    """Neural network trainer."""
    
    def __init__(self, model, device, optimizer):
        self.model = model
        self.device = device
        self.optimizer = optimizer 

    @staticmethod
    def loss_fn(outputs, targets):
        return nn.CrossEntropyLoss()(outputs, targets)
        
    def train(self, data_loader):
        """Train model on one epoch. Return train loss."""
        
        self.model.train()
        loss = 0
        for i, (data, targets) in enumerate(data_loader):
            data = data.to(self.device).reshape(data.shape[0], -1).float()
            targets = targets.to(self.device).long()
            
            # Forward pass
            outputs = self.model(data)
            J = self.loss_fn(outputs, targets)
            
            # Backward pass
            self.optimizer.zero_grad()
            J.backward()
            self.optimizer.step()

            # Cumulative loss
            loss += (J.detach().item() - loss) / (i + 1)

        return loss


    def eval(self, data_loader):
        """Return validation loss and validation accuracy."""
        
        self.model.eval()
        num_correct = 0
        num_samples = 0
        loss = 0.0
        with torch.no_grad():
            for i, (data, targets) in enumerate(data_loader):
                data = data.to(self.device).float()
                targets = targets.to(self.device)
                
                # Forward pass
                data = data.reshape(data.shape[0], -1)
                out = self.model(data)
                J = self.loss_fn(out, targets)
                _, preds = out.max(dim=1)

                # Cumulative metrics
                loss += (J.detach().item() - loss) / (i + 1)
                num_correct += (preds == targets).sum().item()
                num_samples += preds.shape[0]

        acc = num_correct / num_samples
        return loss, acc

Some config and setup prior to training. For our dataset, we use MNIST which we get from scikit-learn.

# Config
RANDOM_STATE = 42
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
EPOCHS = 100
PATIENCE = 5
INPUT_SIZE = 784
NUM_CLASSES = 10

# Fetch data
MNIST = fetch_openml("mnist_784")
X = MNIST['data'].reshape(-1, 28, 28)
y = MNIST['target'].astype(int)

# Create folds
cv = model_selection.StratifiedKFold(n_splits=5)
trn_, val_ = next(iter(cv.split(X=X, y=y)))

# Get train and valid data loaders
train_dataset = MNISTDataset(X[trn_, :], y[trn_], transform=transforms.ToTensor())
valid_dataset = MNISTDataset(X[val_, :], y[val_], transform=transforms.ToTensor())

Intermediate values

Finally, we set up the study instance and its objective function. Note that the search space is dynamically constructed depending on the number of layers (i.e. an earlier suggestion for a hyperparameter). During training, we perform early stopping on validation loss. If no new minimum val. loss is found after 5 epochs, then the minimum val. loss is returned as the objective 3.

Computing intermediate values allow us to prune unpromising trials to conserve resources. The default pruner in Optuna is optuna.pruners.MedianPruner which prunes a trial if its best intermediate result as of the current step (e.g. current best valid loss) is worse than the median of all intermediate results of previous trials at the current step. Hence, the best intermediate result of a pruned trial is less than the best intermediate result of 1/2 of the other trials as of that step. In our case, if the minimum val. loss does not improve too quickly, then the trial is pruned. Of course, the validation loss could descend rapidly at later steps, but the median pruner does not bet on this happening.

def define_model(trial):
  
    # Optimize the # of layers, hidden units and dropout ratio in each layer.
    n_layers = trial.suggest_int("n_layers", 1, 3)
    out_features = []
    drop_rate = trial.suggest_float('dropout_rate', 0.2, 0.5)
    for i in range(n_layers):
        out_features.append(trial.suggest_int("n_units_l{}".format(i), 4, 128))

    return MLPClassifier(INPUT_SIZE, NUM_CLASSES, n_layers, out_features, drop_rate)


def objective(trial):

    model = define_model(trial).to(DEVICE)
    batch_size = trial.suggest_int('batch_size', 8, 512, log=True)
    learning_rate = trial.suggest_loguniform('lr', 1e-5, 1e-1)
    weight_decay = trial.suggest_float('weight_decay', 0.0, 0.5)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=3)
    engine = Engine(model, DEVICE, optimizer)

    # Init. dataloaders
    train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
    valid_loader = DataLoader(dataset=valid_dataset, batch_size=batch_size, shuffle=True)
    
    # Run training
    best_loss = np.inf
    patience = PATIENCE
    for epoch in tqdm(range(EPOCHS), total=EPOCHS, leave=False):

        # Train and validation step
        train_loss = engine.train(train_loader)
        valid_loss, valid_acc = engine.eval(valid_loader)

        # Reduce learning rate
        if scheduler is not None:
            scheduler.step(valid_loss)
            
        # Early stopping
        if valid_loss < best_loss:
            best_loss = valid_loss
            patience = PATIENCE
        else:
            patience -= 1
            if patience == 0:
                break
    
        # Pruning unpromising trials
        trial.report(valid_loss, step=epoch)
        if trial.should_prune():
            raise optuna.TrialPruned()

    return best_loss

# Create and run optimization problem 
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=60)
[I 2021-09-23 16:35:45,054] A new study created in memory with name: no-name-12007090-de83-452d-95f7-6afe312869a9
[I 2021-09-23 16:36:55,854] Trial 0 finished with value: 0.21022988284386487 and parameters: {'n_layers': 2, 'dropout_rate': 0.27948169944648965, 'n_units_l0': 88, 'n_units_l1': 115, 'batch_size': 137, 'lr': 0.008514613889742123, 'weight_decay': 0.3670069593807663}. Best is trial 0 with value: 0.21022988284386487.
[I 2021-09-23 16:42:48,011] Trial 1 finished with value: 0.08726624973227211 and parameters: {'n_layers': 1, 'dropout_rate': 0.4338336474302041, 'n_units_l0': 125, 'batch_size': 21, 'lr': 0.0026570069013367647, 'weight_decay': 0.18948308621002236}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:44:59,376] Trial 2 finished with value: 0.1323056104992117 and parameters: {'n_layers': 2, 'dropout_rate': 0.21647904438644167, 'n_units_l0': 104, 'n_units_l1': 75, 'batch_size': 201, 'lr': 2.821174991898835e-05, 'weight_decay': 0.1794091322194259}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:45:53,755] Trial 3 finished with value: 0.22288295945950917 and parameters: {'n_layers': 1, 'dropout_rate': 0.47440500392496415, 'n_units_l0': 28, 'batch_size': 250, 'lr': 0.0001789132830643754, 'weight_decay': 0.27924720478580706}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:48:09,581] Trial 4 finished with value: 2.3010145167897114 and parameters: {'n_layers': 3, 'dropout_rate': 0.28055390550281156, 'n_units_l0': 113, 'n_units_l1': 25, 'n_units_l2': 21, 'batch_size': 13, 'lr': 0.014257299782954121, 'weight_decay': 0.13705041929131956}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:51:34,557] Trial 5 pruned. 
[I 2021-09-23 16:52:43,213] Trial 6 finished with value: 0.1177581399365255 and parameters: {'n_layers': 1, 'dropout_rate': 0.4482671105170357, 'n_units_l0': 113, 'batch_size': 79, 'lr': 0.00015027733498346874, 'weight_decay': 0.3609538112622472}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:52:53,391] Trial 7 pruned. 
[I 2021-09-23 16:53:05,382] Trial 8 pruned. 
[I 2021-09-23 16:54:19,755] Trial 9 finished with value: 0.1301146842344524 and parameters: {'n_layers': 1, 'dropout_rate': 0.43234779233219023, 'n_units_l0': 77, 'batch_size': 92, 'lr': 0.0001244240643891887, 'weight_decay': 0.42219789050012224}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:54:25,639] Trial 10 pruned. 
[I 2021-09-23 16:55:02,303] Trial 11 finished with value: 0.10693867458030581 and parameters: {'n_layers': 1, 'dropout_rate': 0.38676795279677584, 'n_units_l0': 120, 'batch_size': 445, 'lr': 0.0009051925392143345, 'weight_decay': 0.30326382618609626}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:55:51,786] Trial 12 pruned. 
[I 2021-09-23 16:56:26,328] Trial 13 finished with value: 0.12493549846112731 and parameters: {'n_layers': 1, 'dropout_rate': 0.3550429345374202, 'n_units_l0': 126, 'batch_size': 503, 'lr': 0.0012205051681880804, 'weight_decay': 0.4696091813861093}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:56:42,622] Trial 14 pruned. 
[I 2021-09-23 16:56:43,755] Trial 15 pruned. 
[I 2021-09-23 16:56:48,863] Trial 16 pruned. 
[I 2021-09-23 16:56:53,308] Trial 17 pruned. 
[I 2021-09-23 16:57:10,668] Trial 18 pruned. 
[I 2021-09-23 16:57:18,708] Trial 19 pruned. 
[I 2021-09-23 16:57:19,972] Trial 20 pruned. 
[I 2021-09-23 16:57:22,372] Trial 21 pruned. 
[I 2021-09-23 16:58:29,770] Trial 22 finished with value: 0.11929829030444739 and parameters: {'n_layers': 1, 'dropout_rate': 0.4505547548269105, 'n_units_l0': 127, 'batch_size': 74, 'lr': 0.00017875642777059658, 'weight_decay': 0.4075948726688359}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:58:42,515] Trial 23 pruned. 
[I 2021-09-23 16:59:12,460] Trial 24 pruned. 
[I 2021-09-23 16:59:14,469] Trial 25 pruned. 
[I 2021-09-23 17:00:27,573] Trial 26 finished with value: 0.08047253501394556 and parameters: {'n_layers': 1, 'dropout_rate': 0.36885178395967316, 'n_units_l0': 108, 'batch_size': 196, 'lr': 0.00031607870752817664, 'weight_decay': 0.14341725932871502}. Best is trial 26 with value: 0.08047253501394556.
[I 2021-09-23 17:00:28,679] Trial 27 pruned. 
[I 2021-09-23 17:00:30,182] Trial 28 pruned. 
[I 2021-09-23 17:00:31,414] Trial 29 pruned. 
[I 2021-09-23 17:00:43,521] Trial 30 pruned. 
[I 2021-09-23 17:00:44,725] Trial 31 pruned. 
[I 2021-09-23 17:02:07,400] Trial 32 pruned. 
[I 2021-09-23 17:02:08,869] Trial 33 pruned. 
[I 2021-09-23 17:02:10,157] Trial 34 pruned. 
[I 2021-09-23 17:06:37,772] Trial 35 finished with value: 0.08286822465960587 and parameters: {'n_layers': 1, 'dropout_rate': 0.2566693607067595, 'n_units_l0': 92, 'batch_size': 33, 'lr': 0.00023038433623588474, 'weight_decay': 0.17907560196864042}. Best is trial 26 with value: 0.08047253501394556.
[I 2021-09-23 17:07:00,968] Trial 36 pruned. 
[I 2021-09-23 17:07:11,660] Trial 37 pruned. 
[I 2021-09-23 17:11:14,050] Trial 38 finished with value: 0.07970190724682759 and parameters: {'n_layers': 1, 'dropout_rate': 0.2550241291506312, 'n_units_l0': 107, 'batch_size': 27, 'lr': 0.000263494124965755, 'weight_decay': 0.16351684276444545}. Best is trial 38 with value: 0.07970190724682759.
[I 2021-09-23 17:11:20,992] Trial 39 pruned. 
[I 2021-09-23 17:11:26,819] Trial 40 pruned. 
[I 2021-09-23 17:12:09,104] Trial 41 pruned. 
[I 2021-09-23 17:18:54,125] Trial 42 finished with value: 0.09131516358004889 and parameters: {'n_layers': 1, 'dropout_rate': 0.3099118013642247, 'n_units_l0': 122, 'batch_size': 13, 'lr': 0.00012731056966105474, 'weight_decay': 0.2348841683626428}. Best is trial 38 with value: 0.07970190724682759.
[I 2021-09-23 17:27:46,436] Trial 43 pruned. 
[I 2021-09-23 17:29:19,866] Trial 44 pruned. 
[I 2021-09-23 17:29:28,206] Trial 45 pruned. 
[I 2021-09-23 17:37:31,618] Trial 46 finished with value: 0.08286373784078745 and parameters: {'n_layers': 1, 'dropout_rate': 0.2369304595931093, 'n_units_l0': 116, 'batch_size': 14, 'lr': 9.466146482797484e-05, 'weight_decay': 0.20151296294549714}. Best is trial 38 with value: 0.07970190724682759.
[I 2021-09-23 17:37:35,168] Trial 47 pruned. 
[I 2021-09-23 17:37:39,402] Trial 48 pruned. 
[I 2021-09-23 17:39:33,132] Trial 49 finished with value: 0.0823763620058915 and parameters: {'n_layers': 1, 'dropout_rate': 0.20064752116941653, 'n_units_l0': 104, 'batch_size': 62, 'lr': 0.00040245052952413314, 'weight_decay': 0.18911575126002592}. Best is trial 38 with value: 0.07970190724682759.
[I 2021-09-23 17:39:35,717] Trial 50 pruned. 
[I 2021-09-23 17:40:02,287] Trial 51 pruned. 
[I 2021-09-23 17:40:29,191] Trial 52 pruned. 
[I 2021-09-23 17:41:12,289] Trial 53 pruned. 
[I 2021-09-23 17:41:28,513] Trial 54 pruned. 
[I 2021-09-23 17:44:22,251] Trial 55 finished with value: 0.06355256314022178 and parameters: {'n_layers': 1, 'dropout_rate': 0.25711536163064735, 'n_units_l0': 117, 'batch_size': 57, 'lr': 0.0003044584963559317, 'weight_decay': 0.048125362770925995}. Best is trial 55 with value: 0.06355256314022178.
[I 2021-09-23 17:44:25,394] Trial 56 pruned. 
[I 2021-09-23 17:44:28,221] Trial 57 pruned. 
[I 2021-09-23 17:44:42,776] Trial 58 pruned. 
[I 2021-09-23 17:44:48,257] Trial 59 pruned. 
from optuna.trial import TrialState

pruned_trials = study.get_trials(deepcopy=False, states=[TrialState.PRUNED])
complete_trials = study.get_trials(deepcopy=False, states=[TrialState.COMPLETE])

print("Study statistics: ")
print("  Number of finished trials:\t", len(study.trials))
print("  Number of pruned trials:\t", len(pruned_trials))
print("  Number of complete trials:\t", len(complete_trials))

print("\nBest trial:")
trial = study.best_trial

print("  Value: ", trial.value)
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))
Study statistics: 
  Number of finished trials:	 60
  Number of pruned trials:	 43
  Number of complete trials:	 17

Best trial:
  Value:  0.06355256314022178
  Params: 
    n_layers: 1
    dropout_rate: 0.25711536163064735
    n_units_l0: 117
    batch_size: 57
    lr: 0.0003044584963559317
    weight_decay: 0.048125362770925995

Trials below either early stops (gradient descent loses momentum) or gets pruned (unlikely to improve even if gradient descent continues). Note that pruning starts at Trial 5. This can be tweaked in the n_startup_trials=5 parameter of the pruner. In this case, pruning is disabled until the 5 trials finish in the same study. This is so that the pruner obtains enough information about the behavior of the gradient descent optimizer before starting to prune.

plot_html(optuna.visualization.plot_intermediate_values(study))
plot_html(optuna.visualization.plot_optimization_history(study))

Hyperparameter interactions

We look at which combinations of hyperparameters work well from the parallel coordinate plot. Note that there is something weird going on here. For example, trials with n_layers=1 has coordinates in axes where they should have no values, e.g. n_units_l1 and n_units_l2. This is a known issue for parallel plots, e.g. #1809. Turns out, lines with dynamically constructed parameters with NaNs should be skipped by plotter. Moreover, trials with NaN values are excluded from the parameter importance computation which limits its usefulness.

plot_html(optuna.visualization.plot_parallel_coordinate(study))
study.trials_dataframe().head()
number value datetime_start datetime_complete duration params_batch_size params_dropout_rate params_lr params_n_layers params_n_units_l0 params_n_units_l1 params_n_units_l2 params_weight_decay state
0 0 0.210230 2021-09-23 16:35:45.058280 2021-09-23 16:36:55.853553 0 days 00:01:10.795273 137 0.279482 0.008515 2 88 115.0 NaN 0.367007 COMPLETE
1 1 0.087266 2021-09-23 16:36:55.857589 2021-09-23 16:42:48.010031 0 days 00:05:52.152442 21 0.433834 0.002657 1 125 NaN NaN 0.189483 COMPLETE
2 2 0.132306 2021-09-23 16:42:48.014477 2021-09-23 16:44:59.375034 0 days 00:02:11.360557 201 0.216479 0.000028 2 104 75.0 NaN 0.179409 COMPLETE
3 3 0.222883 2021-09-23 16:44:59.381142 2021-09-23 16:45:53.755409 0 days 00:00:54.374267 250 0.474405 0.000179 1 28 NaN NaN 0.279247 COMPLETE
4 4 2.301015 2021-09-23 16:45:53.757336 2021-09-23 16:48:09.580522 0 days 00:02:15.823186 13 0.280554 0.014257 3 113 25.0 21.0 0.137050 COMPLETE
study.trials_dataframe().query("state=='COMPLETE'").params_n_layers.value_counts()
1    14
2     2
3     1
Name: params_n_layers, dtype: int64

Instead, we can look at each subset of trials for different values of n_layers. The resulting trials have no NaN parameters since the paramaters are sampled after a value for n_layers has been suggested. Looks like n_layers=1 works best.

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Isolate a study for each value of n_layers
studies = [optuna.create_study() for j in range(3)]
for j in range(3):
    studies[j].add_trials([t for t in study.trials if t.params['n_layers'] == j+1])
    fig = optuna.visualization.plot_parallel_coordinate(studies[j])
    plot_html(fig)
[I 2021-09-23 17:44:51,716] A new study created in memory with name: no-name-6ca29b5a-3524-441a-b429-26b0a39e11f5
[I 2021-09-23 17:44:51,722] A new study created in memory with name: no-name-dacfd576-1b50-4842-8153-2932fe6db7eb
[I 2021-09-23 17:44:51,728] A new study created in memory with name: no-name-7183fded-a3b8-4c5d-b803-241191ad9c25
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:7: ExperimentalWarning:

add_trials is experimental (supported from v2.5.0). The interface can change in the future.

/usr/local/lib/python3.7/dist-packages/optuna/study/study.py:969: ExperimentalWarning:

add_trial is experimental (supported from v2.0.0). The interface can change in the future.

From the following contour plot, we see that a low batch size is generally good, with high values of dropout, learning rate, and weight decay, and only a single hidden layer. From the above parallel plot, a hidden layer of size around 90 looks good.

fig = optuna.visualization.plot_contour(study, params=['batch_size', 'lr', 'n_layers', 'weight_decay', 'dropout_rate'])
fig.update_layout(autosize=False, width=1200, height=1200)
plot_html(fig)
fig = optuna.visualization.plot_contour(study, params=['batch_size', 'lr'])
fig.show()
optuna.visualization.plot_optimization_history(study)

Appendix: Hyperparameters of commonly used models


../_images/hyp.png

Fig. 9 Table from p. 184 of [Tha20]. RS\(^*\) implies random search should be better.


1

Like all applied machine learning solutions.

2

See Optuna dashboard which displays the same plots that are updated in real-time.

3

In practice, we save the best model parameters at this point.